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IMPROVING SAMC USING SMOOTHING METHODS: THEORY 
AND APPLICATIONS TO BAYESIAN MODEL 
SELECTION PROBLEMS 1 

By Faming Liang 

Texas A&M University 

Stochastic approximation Monte Carlo (SAMC) has recently been 
proposed by Liang, Liu and Carroll [J. Amer. Statist. Assoc. 102 
(2007) 305-320] as a general simulation and optimization algorithm. 
In this paper, we propose to improve its convergence using smoothing 
methods and discuss the application of the new algorithm to Bayesian 
model selection problems. The new algorithm is tested through a 
change-point identification example. The numerical results indicate 
that the new algorithm can outperform SAMC and reversible jump 
MCMC significantly for the model selection problems. The new al- 
gorithm represents a general form of the stochastic approximation 
Markov chain Monte Carlo algorithm. It allows multiple samples to 
be generated at each iteration, and a bias term to be included in the 
parameter updating step. A rigorous proof for the convergence of the 
general algorithm is established under verifiable conditions. This pa- 
per also provides a framework on how to improve efficiency of Monte 
Carlo simulations by incorporating some nonparametric techniques. 

1. Introduction. As known by many researchers, the Metropolis-Hastings 
(MH) algorithm [Metropolis et al. (1953), Hastings (1970)] and the Gibbs 
sampler [Geman and Geman (1984)] are prone to get trapped into local en- 
ergy minima in simulations from a system for which the energy landscape 
is rugged. In terms of physics, the negative of the logarithmic density /mass 
function is called the energy function of the system. To overcome the local- 
trap problem, many advanced Monte Carlo algorithms have been proposed, 
such as parallel tempering [Geyer (1991), Hukushima and Nemoto (1996)], 
simulated tempering [Marinari and Parisi (1992), Geyer and Thompson (1995) 
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evolutionary Monte Carlo [Liang and Wong (2001)], dynamic weighting 
[Wong and Liang (1997)], multicanonical sampling [Berg and Neuhaus (1991)], 
l/A;-ensemble sampling [Hesselbo and Stinchcomble (1995)], the Wang-Landau 
algorithm [Wang and Landau (2001), Liang (2005)], equi-energy sampler 
[Mitsutake, Sugita and Okamoto (2003), Kou, Zhou and Wong (2006)], 
stochastic approximation Monte Carlo (SAMC) [Liang, Liu and Carroll (2007), 
Atchade and Liu (2007)], among others. Henceforth, the work by 
Liang, Liu and Carroll (2007) will be referred to as LLC. 

Among the above algorithms, SAMC is a very sophisticated one in both 
theory and applications. The basic idea of SAMC stems from the Wang- 
Landau algorithm and can be explained briefly as follows. Let 

(1) f(x) = ap(x), xeX, 

denote the target probability density/mass function we are working with, 
where X is the sample space and c is an unknown constant. Let E±, . . . , E m 
denote a partition of X, and let lo{ = f E .ip(x)dx for i = 1, ...,m. SAMC 
seeks to sample from the trial distribution 

(2) /.(x)oc£^M/(xE^), 

where 7Tj's are prespecified constants such that 7Tj > for all i and 2j=i = 
1. It is easy to see that if ui, . . . ,uj m are known, sampling from /^(x) will 
result in a "random walk" in the space of subregions (by regarding each sub- 
region as a "point") with each subregion being sampled with a frequency 
proportional to 7Tj. Hence, the local-trap problem can be overcome essen- 
tially, provided that the sample space is partitioned appropriately. How to 
partition the sample space will be discussed later. SAMC has been applied 
successfully to many hard computational problems, such as phylogenetic 
tree reconstruction [Cheon and Liang (2007)] neural network training [Liang 
(2007)] and Bayesian model selection [LLC (2007)]. 

The success of SAMC depends crucially on the estimation of <jj{. LLC 
propose to estimate u>i simultaneously using a stochastic approximation 
Markov chain Monte Carlo algorithm. Let 0u denote the working estimate 
of log(w.j/7Tj) obtained at iteration t, Ot = (On, ■ ■ ■ ,0t m ), an d let {jt} be a 
positive, nonincreasing sequence satisfying the conditions 

oo oo 

(3) (i) ^2lt = oo, (ii) ^ 7i C <oo, 

t=i t=i 

for any £ > 1. Since f w (x) is invariant to a scale change of to = (lj\, . . . ,uj m ), 
that is, fcui(x) = fuj(x) for any number c > 0, the domain of t 
can be restricted to a compact set by adjusting 9t with a constant 
vector, provided that G is large enough. Refer to Chen (2002) and 
Andrieu, Moulines and Priouret (2005) for more discussions on this issue. 
The SAMC algorithm iterates between the following two steps. 
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SAMC ALGORITHM. 

(a) Simulate a sample xt by a single MH update with the invariant dis- 
tribution 




(b) Set 9* = 9 t + 7t+i(e t - 7r), where e t = (e ti i, . . . , e t)m ) and e t>i = 1 if 
xt G Ei and otherwise. If 0* G 0, set 6t+i = 6*; otherwise, set Ot+i = 6* + 
c* , where c* = (c*,...,c*) can be an arbitrary vector which satisfies the 
condition f + c'e8. 

Under mild conditions, LLC established the convergence of 9 t . The su- 
periority of SAMC in sample space exploration is due to its self-adjusting 
mechanism. If a subregion is visited, 6 t will be updated accordingly such that 
this subregion has a smaller probability to be revisited in the next iteration. 
Mathematically, if x t G Ei, then Ot+i^i <— Qt,% + 7i+i(l — TTj) and &t+i,j 
Gt,i — Jt+lftj f° r j 7^ However, this mechanism has not yet reached its 
maximum efficiency because it does not differentiate between the neighbor- 
ing and nonneighboring subregions of Ei. We note that for many problems, 
E\ , . . . , E m form a sequence of naturally ordered categories with lo± , . . . , u m 
changing smoothly along the index of subregions. For example, for model se- 
lection problems X can be partitioned according to the index of models, the 
subregions can be naturally ordered according to the number of parameters 
contained in each model, and the neighboring subregions often contain sim- 
ilar probability values. Intuitively, xt may contain some information on its 
neighboring subregions, so the visiting to its neighboring subregions should 
also be penalized to some extent in the next iteration. Consequently, this im- 
proves the ergodicity of the simulation. Henceforth, we will call a partition 
with oji , . . . , u> m changing smoothly a smooth partition or say the sample 
space is partitioned smoothly, and assume that there exists a smooth parti- 
tion for the problem under study. 

In this paper, we show that the efficiency of SAMC can be improved by in- 
cluding at each iteration a smoothing step, which distributes the information 
contained in each sample to its neighboring subregions. The new algorithm 
is thus called smoothing-SAMC or SSAMC for simplicity. SSAMC is tested 
through a change-point identification example in this paper. Our numerical 
results show that it outperforms both SAMC and reversible jump Markov 
chain Monte Carlo (RJMCMC) [Green (1995)] for that example. By com- 
paring the sampling mechanisms of SSAMC and RJMCMC, we argue that 
SSAMC can be superior to RJMCMC for the model selection problems for 
which the sample space can be partitioned smoothly. A rigorous proof for 
the convergence of SSAMC is provided in the Appendix. As discussed later, 
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SSAMC represents the most general form of the stochastic approximation 
MCMC algorithm. 

The remainder of this paper is organized as follows. In Section 2, we 
describe the SSAMC algorithm and prove a theorem concerning its con- 
vergence. In Section 3, we illustrate the use of SSAMC through a mixture 
Gaussian example. In Section 4, we apply SSAMC to a change-point identifi- 
cation example. In Section 5, we conclude this paper with a brief discussion. 

2. Smoothing-SAMC algorithm. Suppose that we are working with a 
distribution as specified in (1) and that the sample space X has been parti- 
tioned into m disjoint subregions E\, . . . , E m according to a function denoted 
by A(x). Furthermore, we suppose that the subregions have been ordered 
such that the weights u\ , . . . , u m change smoothly along the index of the 
subregions. 

The SSAMC algorithm is different from the SAMC algorithm in two as- 
pects. First, the gain factor sequence used in SSAMC is a little more re- 
strictive than that used in SAMC. In SSAMC, the gain factor sequence is 
required to be positive and nonincreasing, and satisfy the following condi- 
tions: 



(5) 

fin) 



lim 7 t = 0, (ii) lim \ j t - j t+1 \ < oo, 

oo oo 

7t = °°> ( iv ) It < 00 for any C > 1- 

t=i t=i 



The trade-off is that a higher-order noise term can be included in updating 
9t as prescribed in (21). In this paper, we set 

(6) lt = i = l,2,..., 

max{Jo> tj 

in all computations, where To is a prespecified number. It is easy to see that 
(6) satisfies the condition (5). 

Second, SSAMC allows multiple samples to be generated at each it- 
eration, and employs a smoothed estimate of pu in updating 6t, where 
pu = J E . fe t (x) dx is the probability that a sample is drawn from Ei at 

iteration t, and fe t {x) is as defined in (4). Let . . . , denote the 
samples generated by a MH kernel with the invariant distribution fe t {x). 
Since n is usually a small number, say, 10 to 20, the samples form a sparse 

frequency vector e Xt = (eti, . . . , ej m ) with eu = J2i=iH x t^ e Ei)- Because 
the law of large numbers does not apply here, e Xi /K is not a good esti- 
mator of pi = (pu, ■ ■ ■ ,Ptm)- As suggested by many authors, for example, 
Burman (1987), Hall and Titterington (1987), Dong and Simonoff (1994), 
Fan, Heckman and Wand (1995) and Aerts, Augustyns and Janssen (1997), 
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the frequency estimate can be improved by a smoothing method. Since we 
have assumed that the partition is smooth, information in nearby subregions 
can be borrowed to help produce more accurate estimates of pt- 

In this paper, the frequency estimator e Xt /fc is smoothed by the Nadaraya- 
Watson kernel estimator; that is, pu is estimated by 

m ^ _ J2T=iW(A( i -j)/(mh t ))e tj /K 

U PU ET=iW(A(z-j)/(mh t )) ' 

where is a kernel function with bandwidth ht, and A is a rough es- 

timate of the range of X(x), x G X. Here, it is assumed that has a 

bounded support; that is, there exists a constant C such that W(z) = 
if \z\ > C. Under this assumption, it is easy to show that the deviation 
of pu from the frequency estimate eu/n is of the order 0(h t ); that is, 
Pu — eti/n = O(ht). Refer to the Appendix [around (32)] for the details of 
the proof. We have many choices for W(z), for example, an Epanechnikov 
kernel or a double-truncated Gaussian kernel. The former is standard, and 
the latter can be written as 

exp(-z 2 /2), if|z|<C, 
0, otherwise. 



(8) W(z) 



The bandwidth ht is chosen as a power function of 7t, that is, ht = for 
a > and b > 0. Here b specifies the decay rate of the smoothing adaptation 
in the SSAMC algorithm. For a small value of b, the adaptation can decay 
very slowly. In all computations of this paper, W(z) is set to the double- 
truncated Gaussian kernel with C = 3 and 

where the second term in min{-,-} is the default bandwidth used in con- 
ventional density estimation procedures for continuous observations, for ex- 
ample, S-PLUS 5.0 [Venables and Ripley (1999), page 135]. It is easy to see 
that ht = yj^ft when t becomes large. 

In summary, one iteration of the SSAMC algorithm consists of the fol- 
lowing three steps: 



SSAMC ALGORITHM. 

(a) (Sampling) Simulate samples x.jp , . . . , using the MH algorithm 
with the proposal distribution q(x l f > , ■) and the invariant distribution fo t (x) 
as defined in (4), where x^ = x^ l . 

(b) (Smoothing) Calculate pt = (pti, ■ ■ ■ ,Ptm) in (7). 
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(c) ( Weight updating) Set 

(10) 9* = t + 7t+ i(p t -7T). 

If 9* £ 0, set 6> m = 6>*; otherwise, set 9 t +i = 9* + c*, where c* = (c*, . . . , c*) 
can be any vector which satisfies the condition 9* + c* £ Q. 

For reasons of mathematical convenience, we assume that X is either 
finite (for a discrete system) or compact (for a continuum system). For the 
latter case, X can be restricted to the region {x :ij)(x) > V'min}, where V'min 
is sufficiently small such that the region {x :ip(x) < V'min} is not of interest. 
As in SAMC, O can also be restricted to a compact set. In this paper, we 
set = [— 10 100 , 10 100 ] m , although as a practical matter this is essentially 
equivalent to setting = M. m . Since both X and are compact, it is natural 
to assume that fg t (x) is bounded away from and oo on X. Furthermore, 
we assume that the proposal distribution q(-, •) used in the sampling step of 
SSAMC satisfies the local positive condition; that is, for every x £ X, there 
exists ei and t2 such that q(x,y) > 62 if \\x — y\\ < e±, where ||z|| denotes the 
norm of the vector z. In a study of MCMC theory, the proposal distribution 
is often assumed to satisfy the local positive condition [Roberts and Tweedie 
(1996)]. 

Under the above assumptions, we establish the following result concerning 
the convergence of SSAMC. (A formal statement of this result and the proof 
are given in the Appendix.) As t — > 00, we have 

(11) ff t , _> J Const + log^^ ip(x) dx^j - log(7Tj + v), if Ei ^ 0, 



\ 



-00, if Ei = 0, 



where v = J2j^{i-. E t =&} n j/ ( m ~ m o) an( ^ m o 1S t ne number of empty subre- 
gions, and Const represents an arbitrary constant. Since fg t ( x ) is invariant 
with respect to a location transformation of 6t, Const cannot be determined 
by the samples drawn from fe t {x). To determine the constant term, extra 
information, for example, Ya^=i ^ u is equal to a known number, is needed. 

LLC discussed several practical issues on implementation of SAMC, in- 
cluding sample space partitioning, convergence diagnostic, and parameter 
setting (for 7r, To and the total number of iterations), most of which are still 
applicable to SSAMC. To make the paper self-contained, they are briefly 
discussed as follows. 

The sample space should be partitioned such that the MH chain can mix 
reasonably fast within the same subregion. For example, if one chooses to 
partition the sample space according to the energy function — logV'(^), the 
partition may be done as follows: E\ = {x : — log^(x) < u%}, E2 = {x : u\ < 
— log^(x) < U2}, ...,E m = {x : — logip(x) > u m -i}, with the energy band- 
width Ui — Ui-i (i = 2, .. . ,m) being less than 2, and u% and u m being cho- 
sen appropriately such that the probabilities contained in E\ and E m are 
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ignorable. This partition ensures that the MH moves within the same sub- 
region have a reasonable acceptance rate. For the model selection problem, 
the sample space is usually partitioned according to the model index by as- 
suming that the MH chain can mix reasonably fast in the sample space of 
each model. If this is not true, one may partition the sample space jointly 
according to the energy function and the model index. 

The convergence of SSAMC can be diagnosed by examining the patterns 
of the estimates of u obtained in multiple runs. If the estimates follow the 
same pattern, we may reasonably think the runs have been converged. Oth- 
erwise, we may think the gain factor is still large at the end of the runs, or 
some parts of the sample space have not yet been visited, and some param- 
eters should be reset as described below. LLC also proposed to diagnose the 
convergence of SAMC based on the realized sampling frequencies. This may 
not work well for SSAMC due to its use of the smoothing estimator at each 
iteration. 

The choice of iv is problem dependent. If one aims at optimization, tv 
may be set biased to low energy regions to improve the ergodicity of the 
simulation; whereas, if one aims at estimating lo, 7r may be set to a discrete 
uniform distribution over the subregions. The parameter To and the total 
number of iterations can be determined by a trial and error process based 
on diagnostics for the convergence of the simulations. If a run is diagnosed 
as unconverged, SAMC should be rerun with a larger value of To, a larger 
number of iterations, or both. In general, a complex problem should associate 
with a large value of To and a large number of iterations. 

Below we discuss two more issues specifically related to SSAMC. 

• On the choice of smoothing estimators. Theoretically, any smoothing esti- 
mator, which satisfies the condition pu — eu/ 'k = 0(h T ) for some r > 0, can 
be used in SSAMC. Other than the Nadaraya-Watson kernel estimator, 
the estimators that possibly can be used include the local log-likelihood es- 
timator [Tibshirani and Hastie (1987), Fan, Heckman and Wand (1995)] 
and the local polynomial estimator [Aerts, Augustyns and Janssen (1997)], 
etc. Refer to Simonoff (1998) for a comprehensive review of smoothing es- 
timators. 

• On the choice of k. Since the convergence of SSAMC is determined by the 
three parameters k, Tq and N (the total number of iterations) together, 
we suggest that the value of k should be determined together with the 
values of To and N through a trial and error process as described above. 
In practice, k is usually set to a number less than 20. Since the gain 
factor is kept at a constant in each iteration, a run with a large k has to 
end at a large value of 7t, provided that the total running time is fixed. 
The estimates produced by a run ending at a large value of gain factor 
are often highly variable. In our experience, SSAMC can benefit from the 
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smoothing operation even when k is as small as 5, and the maximum 
benefit is usually attained at a value of k between 10 and 20. 



3. An illustrative example. In this section, we illustrate the use of SSAMC 
through a mixture Gaussian example. Our numerical results indicate that 
SSAMC can converge much faster than SAMC for this example. Consider 
the following distribution: 



/(x) 



1* 



+ IJV 



1 0.9 
0.9 1 

1 

1 



+ ±7V 



1 

-0.9 



-0.9 
1 



which is identical to an example given in Gilks, Roberts and Sahu (1998), 
except that the mean vectors are separated by a larger distance in each 
dimension. Figure 1 shows the contour plot of the distribution, which indi- 
cates that the distribution contains three well-separated components. The 
MH algorithm was first applied to simulate from f(x) with a random walk 
proposal N(x,l2), but it failed to mix the three components. 
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Let X = [— 10 100 , 10 100 ] 2 be compact, and let it be partitioned according to 
the function X(x) = — log/(x) into the following subregions: E\ = {x : X(x) < 
0.5}, E 2 = {x : 0.5 < X(x) < 1.0}, ...,E 44 = {x: 21.5 < X(x) < 22.0} and E 45 = 
{x:X(x) > 22.0}, as illustrated by Figure 1. Note that in simulations, we 
never need to know where the subregions are. It is enough to know which 
subregion it belongs to for any given sample. SSAMC (also SAMC) offers to 
learn the weights J E i/j(x) dx/ir 4 , . . . , J E ip(x) dx/7r m simultaneously via a 
stochastic approximation process. The self-adjusting mechanism of SSAMC 
ensures the success of the learning process; the entire sample space is fully 
explored and the weights converge to their true values. After convergence, 
importance samples can then be simulated from the target distribution f(x). 
Since our purpose of studying this example is just to illustrate how the con- 
vergence of SAMC can be accelerated by the smoothing operator, the issue 
of post-convergence inference will not be discussed here. Refer to LLC for 
this issue. 

SSAMC was run for this example 20 times independently with the setting: 
ip(x) = f(x), T = 25, k = 20, A = 22, m = 45, N = 5 x 10 5 , and tti = • • • = 
= l/m. Table 1 summarizes the estimates of the probabilities P(Ei) = 
Ie f( x ) & x -> ^ = 5, . . . , 10. Note that the subregions E\, . . . , E4 are empty. Like 
SAMC, SSAMC allows the existence of empty subregions in simulations. The 
corresponding true probability values, which are calculated with a total of 
3 x 10 8 samples drawn equally from each of the three components of f(x), 
are also given in Table 1. 

For comparison, SAMC was applied to this example with the same setting 
as that used by SSAMC except for T = 500 and N = W 7 . SAMC was also 
run 20 times independently. The computational results are also summarized 
in Table 1. SSAMC has made a significant improvement over SAMC in 
terms of accuracy of the estimates; in other words, SSAMC converges much 
faster than SAMC. On average, the RMSE (root mean squared error) of 
the SSAMC estimates is only about half of that of the SAMC estimates. 
Note that under the above settings, we have almost the same gain factor 
sequence {7^} and exactly the same number of energy evaluations in each 
of the SAMC and SSAMC runs. Hence, the comparison is fair. The number 
of energy evaluations is actually a much better measure than the CPU time 
for comparing efficiency of two algorithms, because the CPU cost is usually 
dominated by the part used for energy evaluations when we simulate from a 
complex system, for example, protein folding. We note that this measure has 
long been used in statistical physics [see, e.g., Hesselbo and Stinchcomble 
(1995)]. To provide more evidence for the fairness of our comparisons, we 
also reported in Table 1 the CPU times cost by the above runs; SAMC and 
SSAMC cost about the same CPU time in each run. 

To examine the effect of the sample size k on efficiency of SSAMC, SSAMC 
was rerun with k = 5 and 10. The values of Tq and iV were set accordingly, 
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Table 1 



Comparison 


of RMSEs (root mean 


squared errors) 


of the SAMC 


and SSAMC estimates 










SSAMC 




Estimates 


True prob. (%) 


SAMC 


K = 5 


K = 10 


k = 20 




01 fin 
zl.oU 


21.00 


zl.DD 


21.68 




P(Es) 


21.70 


(0.23) 


(0.13) 


(0.09) 


(0.11) 




19.67 


19.67 


19.70 


19.74 




P(E 6 ) 


19.74 


(0.17) 


(0.10) 


(0.08) 


(0.05) 




23.10 


23.10 


23.08 


23.05 




P(Er) 


23.04 


(0.18) 


(0.09) 


(0.08) 


(0.07) 




14.01 


14.01 


13.99 


13.98 




P(Es) 


13.98 


(0.08) 


(0.06) 


(0.04) 


(0.04) 




8.51 


8.49 


8.49 


8.48 




P(E 9 ) 


8.47 


(0.08) 


(0.04) 


(0.02) 


(0.03) 




5.16 


5.15 


5.15 


5.14 




P(E 10 ) 


5.15 


0.04 


(0.02) 


(0.02) 


(0.02) 


CPU (s) 




33.2 


35.6 


34.8 


33.9 



The numbers in the parentheses are the RMSEs of the estimates. The CPU time (in 
seconds) was measured on a 2.8 GHz computer for a single run of the corresponding 
algorithm. 



To = 500/k and N = 10 7 /k, such that in these runs we have about the same 
gain factor sequence and exactly the same number of energy evaluations as in 
the previous runs. The computational results are also summarized in Table 
1. They indicate that the smoothing operation can improve the accuracy of 
the SAMC estimates generally, even when k is as small as 5. 

4. Bayesian model selection problems. LLC applied SAMC to Bayesian 
model selection problems and compared it to RJMCMC. They conclude 
that SAMC outperforms RJMCMC when the model space is complex, for 
example, it contains several modes which are well separated from each other, 
or some tiny probability models, but, of interest to us. However, when the 
model space is simple, for example, it only contains several models with 
comparable probabilities, SAMC may not work better than RJMCMC, as in 
this case the self-adjusting ability of SAMC is no longer crucial for mixing 
of the models. In this section, we show that for Bayesian model selection 
problems, SSAMC can make a significant improvement over SAMC and it 
can also work better than RJMCMC even when the model space is simple. 
This is illustrated by a change-point identification example. 

The change-point identification problem can be stated as follows. Let 
Z = (z\, Z2, ■ ■ ■ , z n ) denote a sequence of independent observations. Assume 
that the index set {1, 2, ... ,71} has been partitioned into blocks and that the 
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sequence follows the same distribution within blocks; that is, there exists a 
binary vector $ = ($1, . . . , $ n _i) with $ C1 = • • • = $ Cfc = 1 and elsewhere, 
such that 



and 



= c < c\ < ■ ■ ■ < c k < Cfc+i = n 



z i~9r(-), C r -i<i<C r 



for r = 1, 2, . . . , k + 1, where g r {-) is a density. Our task is to identify the 
values of ci , . . . , Cfc . 

Recently this problem has been treated by several authors using simulation- 
based methods, such as the Gibbs sampler [Barry and Hartigan (1993)], 
jump diffusion [Phillips and Smith (1996)], reversible jump MCMC [Green 
(1995)] and evolutionary Monte Carlo [Liang and Wong (2000)]. In this ar- 
ticle, we follow Barry and Hartigan (1993) to consider the case where gy(-) 
is a Gaussian density parameterized by (fi r ,a^). Let denote a configu- 
ration of # with k ones, which represents a model of k change-points. Let 

T,W = (tf 

>A t i) cr i) • • • )Atfc+i,ffc + i), Xk denote the space of models with k 
change-points, 0^ G X k , and X = {J 1 k l =oX k . The log-likelihood of rf k ^ is 



fc+i f 



C-i Cj' 1 , n 1 



(12) L(ZtoW) = -£ p-J2=iloga? + ^ £ (*, - W ) 2 . 

1=1 I 4 J=Cj_l + l ) 

To conduct a Bayesian analysis, the following priors are specified for rp*'. 
The vector is subject to the distribution 

( (n-1)! ' 

which is equivalent to assuming that X k is subject to a truncated Poisson 
distribution with parameter A, and each of the (n — l)!/[A;!(n — 1 — fe)!] 
models in X k is a priori equal. The component mean /ij is subject to an 
improper prior, and the component variance of is subject to an inverse- 
Gamma IG(a,(3). By assuming that all the priors are independent, we have 
the log-prior density, 



fc+l 

(13) logPfaW) 



9 P 

(a - l)logof + -5 



where a k = (k + l)[alog/3 — logT(a)] + log(n — 1 — k)\ + /clogA. The a, /3 
and A are hyperparameters to be chosen by the user. The log-posterior of 
fj( k ) ( U p to an additive constant) can be obtained by adding (12) and (13). 
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Integrating out the parameters m, erf, . . . , Hk+i, from the full posterior 
distribution and taking a logarithm, we have 

logP(tfW|Z)=a fc + ^log27r 

2 log(cj - Ci-i) - logT — ^ — - + a 



(14) 




- — — ~ + a 



x log 



2 . 2(c t -c i _i) 



The MAP (maximum a posteriori) estimate of # is often a reasonable so- 
lution to the problem. In practice, we are also interested in estimating the 
marginal posterior distribution P(X k \Z). SSAMC can be applied to estimate 
this distribution. Without loss of generality, we restrict our consideration to 
the models with k min <k< /c max . Let E k = X k and ip(-) oc P(&^\Z). It fol- 
lows from (11) that /u)j = e 6ti ~ 6tj forms a consistent estimator for the 
ratio P(X i \Z)/P(X j \Z) when t is large. 

For the change-point identification problem, the sampling step of SSAMC 

(k I) 

can be performed as follows. Let i9J ' denote the Zth sample generated at 
iteration t, where k indicates the number of change-points of the sample. 
The next sample can be generated according to the following procedure: 

(a) Set j = k — 1, k, or k + 1 according to probabilities q k j, where q kjk = 5 
for k min <k< k max , g fcm . njfcmin+1 = <?fc max ,fc max -i = §, and q k>k+1 = q k ,k-i = 3 

if fc mm < k < ^max 1 

(b) If j = k, update -0^^ by a "simultaneous" move (described below); if 
j = k+ 1, update -d^' 1 ' by a "birth" move (described below); and if j = k — 1, 
update by a "death" move (described below). 

The "birth," "death" and "simultaneous" moves are designed similarly to 
those described in Green (1995). In the "birth" move, a random number, say 
u, is first drawn uniformly from the set {0, 1, . . . , k}; then another random 
number, say v, is drawn uniformly from the set {c u + 1, . . . , c u +i — 1}, and 

it is proposed to set •& v = l. The resulting new sample is denoted by ■dX 

In the "death" move, a random number, say u, is drawn uniformly from the 

set {1, 2, . . . , k}, and it is proposed to set i? Cu = 0. The resulting new sample 

(k—l) 

is denoted by . In the "simultaneous" move, a random number, say 

u, is first randomly drawn from the set {1,2, . . . ,k}; then another random 
number, say v, is uniformly drawn from the set {c u _i + 1, . . . , c u — 1, c u + 
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1, . . . , c u+ i — 1}, and it is proposed to set $ Cu = and = 1. The resulting 

(k) 

new sample is denoted by i?i . The acceptance probabilities of the three 
types of moves are as follows. For the "birth" move, it is 

"** P(#i k+1) \X)q k+1 , k c u+1 -c u -l 



(15) min/l, 



e 9 ^+i p(#ffi\x) Qk,k+i 1 
For the "death" move, it is 



(16) min/l, 



P(#f M\X) 1k,k-l Cu+l ~ C u -i 



For the "simultaneous" move, it is 

(17\ ■ Jl P ^\ X ) 
(1') mm 1, 7777\ 

I P(4 k ' l) \X) 
because the proposal densities are symmetric in the sense 

r(0( fc .o _> ^l fc )) 

T(#i k) - 4 k ' l) ) = V(^+i - cu-l ~ 2). 

Our simulated dataset consists of 1000 observations with z±, . . . , Z120 
iV(-0.5,l), 2121, ••• ,^210 ~ JV(0.5,0.5), 02H, •••, ^460 ~ N(0, 1.5), z 46 i,..., 

^530 ~ N(-l,l), Z 53 i,...,Z 6 i5~iV(0.5,2), Z 6 i 6 ,...,Z 7 io~iV(l,l), 0711, 

z 800 ~iV(0, 1), Zsoi, •••,2950 ~iV(0.5,0.5) and 2951, • ■ • ,21000 ~iV(l, 1). The 
time plot is shown in Figure 2. For this dataset we set the hyperparame- 
ters a = (3 = 0.05 and A = 1. In simulations, we set k m [ n = 7 and k max = 14. 
The values of fc m i n and fc max can be determined rapidly with a short pi- 
lot run of the above algorithm. Outside this range, we have P{Xi\Z) ~ 0. 
SSAMC was run 20 times independently with k = 20, T = 5, N = 10 5 , 
A = fcmax — &min + 1, m = 8, and 7Ti = • • • = 7r m = — . The results are summa- 
rized in Figure 2 and Table 2. 

Figure 2 compares the true change-point pattern and its MAP estimate, 
which are (120,210,460,530,615,710,800,950) and (120,211,460,531,610, 
709,801,939), respectively. The largest discrepancy of the two patterns oc- 
curs at the last change-point position. A detailed exploration of the original 
data gives a strong support to the MAP estimate. The last ten observations 
of the second last cluster have a larger mean value than the expected and 
thus, they tend to be clustered to the last cluster. Our computation shows 
that the log-posterior probability of the MAP estimate is 5.33 higher than 
that of the true pattern. 

For comparison, SAMC and RJMCMC were also applied to this exam- 
ple. Each algorithm was run 20 times independently. The computational 
results are summarized in Table 2. SAMC employs the same sample space 
partition, the same transition proposals (in the sampling step) and the same 
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Time 

Fig. 2. Comparison of the true change-point pattern (horizontal lines) and its MAP 
estimate (vertical lines). 

parameter setting as SSAMC except for T = 100 and N = 2 x 10 6 . RJM- 
CMC employs the same transition proposals as those used by SSAMC and 

Table 2 

The estimated posterior distribution P(Xk\Z) for the change-point identification example 



SSAMC SAMC MSAMC RJMCMC 



k 


prob(%) 


SD 


prob(%) 


SD 


prob(%) 


SD 


prob(%) 


SD 


7 


0.1010 


0.0023 


0.0944 


0.0029 


0.0978 


0.0020 


0.0907 


0.0046 


8 


55.4666 


0.2470 


55.3928 


0.6112 


55.0810 


0.3507 


55.5726 


0.3451 


9 


33.3744 


0.1659 


33.3728 


0.3573 


33.3798 


0.2228 


33.2117 


0.2052 


10 


9.2982 


0.1026 


9.3647 


0.2788 


9.5903 


0.1351 


9.3537 


0.1441 


11 


1.5655 


0.0287 


1.5785 


0.0685 


1.6457 


0.0304 


1.5694 


0.0400 


12 


0.1768 


0.0042 


0.1803 


0.0097 


0.1871 


0.0042 


0.1845 


0.0097 


13 


0.0157 


0.0005 


0.0154 


0.0009 


0.0166 


0.0004 


0.0165 


0.0011 


14 


0.0018 


0.0001 


0.0011 


0.0001 


0.0018 


0.0001 


0.0009 


0.0002 


CPU (s) 


25.8 




25.5 




24.9 




23.9 





SD: standard deviation of the estimates. CPU: the CPU time (in seconds) cost by a single 
run of the corresponding algorithm on a 2.8 GHz computer. 
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SAMC, and performs 2 x 10 iterations in each run. Therefore, in each run 
the three algorithms perform exactly the same number of energy evalua- 
tions, and SAMC and SSAMC also employ the same gain factor sequence. 
The comparisons made in Table 2 are thus fair to each of the algorithms. 

SSAMC works best for this example among the three algorithms. As 
known by us, RJMCMC can be regarded as a general MH algorithm; and 
for such a simple problem, it is really hard to find another Monte Carlo al- 
gorithm to beat it. However, SSAMC does. SSAMC is different from RJM- 
CMC in two respects. First, like SAMC, it has the capability to self-adjust 
the acceptance rate of the moves. This capability enables it to overcome any 
difficulties in the dimension jumping moves and to explore the entire model 
space very quickly. Second, it has the capability to make use of nearby model 
information to improve its estimation. However, this can hardly be done in 
RJMCMC due to the strict requirement for its Markovian property. These 
two capabilities make SSAMC potentially more efficient than RJMCMC for 
all types of Bayesian model selection problems. This is evidenced by the 
observations: LLC showed that SAMC can make a significant improvement 
over RJMCMC for complex Bayesian model selection problems, and in this 
paper we showed that SSAMC can make a significant improvement over 
RJMCMC for simple Bayesian model selection problems. 

It is worth pointing out that although the overall performance of SAMC 
is worse than that of RJMCMC for this example, SAMC tends to work 
better than RJMCMC for the low probability model spaces, for example, 
the spaces with 7 and 14 change-points. This is due to the fact that SAMC 
samples equally from each model space, while RJMCMC samples from each 
model space proportionally to its probability. 

We have also tried a variation of SSAMC for this example. At each step, 
only multiple samples are generated, but no smoothing is operated on the 
frequency estimator; that is, (10) in the SSAMC algorithm is replaced by 
(18), 

(18) 0* = t + 7t+1 (e Xt /K-7r). 

It was run 20 times with exactly the same setting as that used by SSAMC. 
The results reported in Table 2 under the column MSAMC indicate that 
averaging over multiple samples can improve the convergence of SAMC, but 
a further smoothing operation on the frequency estimator is also important. 

Later, SSAMC was rerun with some other settings, for example, k = 10, 
To = 10 and N = 2 x 10 5 ; it yielded similar results to those reported in 
Table 2. 

5. Discussion. In this paper, we have introduced the SSAMC algorithm, 
studied its convergence and discussed its application to Bayesian model se- 
lection problems. Our numerical results show that SSAMC can converge 



16 



F. LIANG 



much faster than SAMC when the sample space is partitioned smoothly. 
For the problems for which the partition contains abrupt jumps between 
neighboring subregions, smoothing could potentially make the estimation 
worse. In the latter case, the subregions can be reordered according to the 
estimates of u^s from a pilot run such that the resulting partition is smooth. 

This paper has made two contributions to the literature. First, it estab- 
lishes the convergence of a stochastic approximation Markov chain Monte 
Carlo algorithm under verifiable conditions. The algorithm we studied is 
very general. It allows multiple samples to be generated at each itera- 
tion and a higher-order bias term to be included in the weight updat- 
ing step. The existing stochastic approximation MCMC algorithms usu- 
ally only allow a single sample to be generated at each iteration [e.g., 
Benveniste, Metivier and Priouret (1990), Tadic (1997) and 
Andrieu, Moulines and Priouret (2005)]. Younes (1999) proved convergence 
for a stochastic approximation MCMC algorithm which allows for multi- 
ple samples, but does not allow for the higher-order bias term. In addition, 
the conditions assumed by Younes (1999) are less verifiable than those as- 
sumed in this paper. Gu and Kong (1998) also studied convergence of a 
stochastic approximation MCMC algorithm under less verifiable conditions. 
As indicated by Benveniste, Metivier and Priouret (1990), Chen (2002) and 
Kushner and Ying (2003), the convergence theory established in this paper 
can be applied in a much broader context, such as signal processing and 
adaptive control. 

Second, this paper shows how to improve SAMC using smoothing meth- 
ods, which also provides a general framework on how to improve efficiency 
of Monte Carlo simulations by incorporating nonparametric techniques. For 
an illustrative purpose, we employ the Nadaraya-Watson kernel estimator. 
An advanced smoothing technique, such as the local log-likelihood estimator, 
should work better in general. It is worth noting that even when the smooth- 
ing adaptation, which can decay extremely slowly by choosing ht = aj^ and 
b a small positive number, stops, SSAMC does not become the same as 
SAMC. SAMC only allows a single sample to be generated in each iteration, 
while SSAMC allows multiple samples to be generated in each iteration. 
Also, the theory established for SAMC by LLC cannot be directly extended 
to the case of multiple samples. Allowing multiple samples to be generated in 
each iteration is important, as it provides us much freedom to incorporate 
some data-mining techniques into simulations. We hope the present work 
will trigger more research in this direction. 

LLC discussed the applications of SAMC to the problems for which the 
sample space is jointly partitioned according to two functions Xi(x) and 
A2(x). The applicability of SSAMC to these problems is apparent. For the 
joint partitions, the subregions can usually be naturally ordered as a contin- 
gency table, and the smoothing estimator used in this paper can be easily 
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extended to the table. Our preliminary results (not reported here) show that 
in this case, the superiority of SSAMC over SAMC is even more significant. 



The Appendix is organized as follows. In Section A.l, we describe a the- 
orem for the convergence of the SSAMC algorithm. In Section A. 2, we de- 
scribe a general version of the SSAMC algorithm and give conditions for 
its convergence. In Section A. 3, we prove the convergence of the general 
algorithm described in Section A. 2. In Section A. 4, we prove the conver- 
gence of the SSAMC algorithm by verifying that it satisfies the convergence 
conditions of the general algorithm. 

A.l. A convergence theorem for SSAMC. Without loss of generality, we 
only show the convergence presented in (11) for the case that all subregions 
are nonempty, that is, v = 0. Extension to the case v ^ is trivial, since 
replacing (10) by (19) (given below) will not change the process of SSAMC 
simulation: 



where v = (y, . . . , v) is an m-vector of v. 

Theorem A.l. Let E±,..., E m be a partition of a compact sample space 
X and ifj(x) be a nonnegative function defined on X with < f E . tp(x)dx < 
oo for all Ei 's. Let 7v = (tit, . . . ,7r m ) be an m-vector with < 7T; < 1 and 
Y^iL\ 7Tj = 1. Let © be a compact set of m dimensions, and there exists a con- 
stant C such that 9 G 0, where 6 = (#i, . . . , 6 m ) and 9i = C + log(/ E . ip(x) dx) — 
log(7Tj). Let 9q £ be an initial estimate of 6, and 9t € be the estimate 
of 9 at iteration t. Let {jt} be a nonincreasing, positive sequence satisfying 
(5). Let the bandwidth ht be a power function of jt, that is, ht = for 
some a > and b > 0, when t becomes large. Suppose that fe t (x) is bounded 
away from and oo on X , and the proposal distribution satisfies the local 
positive condition. As t — > oo, we have 



APPENDIX: THEORETICAL RESULTS ON SAMC 



(19) 



0' = 9 t + 7t(Pt+i -TT-i/) 



(20) 




i = 1, . . . ,m, 



where Const represents an arbitrary constant. 
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A. 2. A convergence theorem for a general stochastic approximation al- 
gorithm. Let Xi = (a^ , . . . , x^) be the collection of the samples gener- 
ated by a MH kernel at iteration t, f# t (x) be the invariant distribution of 
the MH kernel, H(8 t ,x t +i) = e^ t+1 / k — 7r, h{9) = J xk H (9, x)f#((ix), and 
&+1 = H(6 t ,xt+l) ~ h(0 t ) + 7 t T +1 ??(x t+ i) with r > and r?( x m) being a 
bounded function of xt+i; that is, there exists a constant A such that 
||??(x t +i)|| < A for all £ = 0,1, The SSAMC algorithm can then be ex- 
pressed in a more general form by replacing (10) by (21): 

(21) e* = e t + lt+l h(e t ) + lt+l i t+l . 

In the following the general stochastic approximation MCMC algorithm 
is analyzed under the following conditions. 

Conditions on the step-sizes. 

(Ai) The sequence {"ft}uLo ^ s nonincr easing, positive and satisfies the con- 
dition (5). 

Drift conditions on the transition kernel Pg. Below, we first give some 
definitions on general drift and continuity conditions, and then give the 
specific drift and continuity conditions for the SSAMC algorithm. 

Assume that a transition kernel P is ^-hreducible, aperiodic and has a 
stationary distribution on a sample space denoted by X. A set C C X is said 
to be small if there exists a probability measure v on X, a positive integer 
/ and 5 > such that 

Pg(x,A) > 5v{A) Vx G C, \/A £ B x , 

where Bx is the Borel set of X. A function V : X — > [l,oo) is said to be a 
drift function outside C if there exist constants A < 1 and b such that 

P e V{x) < XV (x) + bl(x G C) Vxe X, 

where PqV(x) = J x P${x, y)^(y) dy. For g:X — >M. d , define the norm 

II || 150*01 
\\g\\ v = sup——, 

xax V{x) 

and define the set Ly = {g : X — > M. d , \\g\\v < oo}. 

The specific drift and continuity conditions for the SSAMC algorithm can 
be described as follows. Let be the joint transition kernel for generating 
the samples x = (x^\ . . . ,x^) at each iteration by ignoring the subscript 
t, X K — X X ■ ■ ■ X X be the product sample space, A = A\ x • • • x A K be a 
measurable rectangle in X K for which Ai G Bx for i = 1, . . . , k, and Bx k = 
Bx x • • • x Bx be the cr-algebra generated by measurable rectangles. 

(A2) The transition kernel Pg is irreducible and aperiodic for any £ 0. 
There exist a function V:X K — > [1, 00) and constants a > 2 and (3 G 
(0, 1] such that: 
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(i) For any 9 G 0, there exist a set C C X K , an integer I, constants 
< A < 1, b,s, 5 > and a probability measure z/ such that 



(22) P^"(x)<Ar(x) + 6/(xGC) VxG^ K , 

(23) Pe^"(x) < sV a (x) Vx G 

(24) P^(x,A)>(Ji/(A) VxeC, VAgB^. 

(ii) There exists a constant ci such that for all x G X K and 0, 0' G 0, 

(25) \\H(6, x) || < Cl V(x), 

(26) ||tf(0,x) - ff(0',x)|| < Cl V(x)||0 - 0'f . 

(iii) There exists a constant C2 such that for all 9,9' G 0, 

(27) ||P 95 -P^||v<c 2 || 5 || V |0-^|^ V<7G£y, 

(28) \\Pe9-Pe'9\\v<C2\\g\\v''\0-0f Vg££ v «. 



Lyapunov condition on h(9). Let £ = {^6 8: = 0}. 

(A3) The function h : — > R d is continuous, and there exists a continuously 
differentiable function t> : — > [0, 00) such that v{9) = V T v(9)h(9) < 0, 
V# G C c and sup eg Q < for any compact set Q C C c . 

A main convergence result. Let P X o,6»o denote the probability measure of 
the Markov chain {(x t ,9t)}, started in (xo, 0o), and implicitly defined by the 
sequences {7*}. Also define D(z,A) = inf z / g ,4 ||z — z'||. 

Theorem A. 2. Assume the conditions (Ai), (A2) and (A3) hold, and 
sup xg ^ re V(x) < 00. Let the sequence {9 n } be defined as in the stochastic 
approximation algorithm. Then for all (xq,9q) G X k x 0, 

\imD(6 t ,C) = 0, P X0 ,e -a.e. 

t — >oo 

A. 3. Proof of Theorem A. 2. The following lemma is a partial restate- 
ment of Proposition 6.1 of Andrieu, Moulines and Priouret (2005). 

Lemma A. 1. Assume the drift condition (A2) . Then the following results 
hold: 

(Bi) For any 9 G 0, the Markov kernel P$ has a single stationary distri- 
bution {q. In addition H :0 x X K is measurable for all 8 G 0, h(9) = 
f XK H(9,x)f e (dx)<w. 

(B2) For any 9 G 0, the Poisson equation u(9,x) — ~Pqu(9,x) = H(9,x) — 
h{9) has a solution u(9,x), where P gu(9, x) = f XK u(9, x')Pg(x, x') dx' . 
There exist a function V : X K ^> [1, 00) such that the set {x G X K : V(x) < 
00} 7^ 0, constant G (0, 1], p > 2 suc/i i/iai /or any compact subset 

e ce, 



20 F. LIANG 

(i) sup eeeo \\H(6,x)\ v < oo, 

(ii) sup eeQo (||u(6',x)||y + \\P e u(0,x)\\ v ) < oo, 

(iii) su P(e ^ )G0o \e- e>\-P(\\u(e, x) - u(o, x')||v+ ||p^(0, x) - p^', 

x)||y) < oo. 

Vladislav Tadic studied the convergence of a stochastic approximation 
MCMC algorithm, which is the same as the SAMC algorithm except that 
it does not include the step of sample space partitioning, under different 
conditions from those given in Andrieu, Moulines and Priouret (2005). Tadic 
proved the following lemma, which corresponds to Theorem 4.1 and Lemma 
2.2 of Tadic (1997). In this paper, we show that the results also hold for 
SSAMC. Our proof is similar to Tadic's except for some necessary changes 
for including the higher-order noise term in £f 

Lemma A. 2. Assume that the conditions (Ai), (A2), (Bi) and (B2) hold 
and that sup xg ^ K V(x) < 00. For the SSAMC algorithm, the following results 
hold: 

(Ci) There exist M. d -valued random processes {et}t>o, {(-'t\t>o and {e'(}t>o 
defined on a probability space (£l,J-,F) such that 

(29) 7mft+i = em + 4fi + em-4'. *>0- 

(C2) The series J2u=o \\ e t II > J2t^o\\ e t\\ 2 and YHu=o \\ e t+i || 2 all converge a.s. 
and 

(30) E(e t+1 \T t ) = 0, a.s.,n>0, 

where {J-t}t>o is a family of a-algebras of T satisfying a{9o} C Tq and 

(C 3 ) Let R t = R' t + R'(, t > I, where R' t = Jt+iV T v(0 t )€t+i, and 

R"+i = f\vv(e t + s(6 t+1 -0 t ))-Vv(6 t )] T (e t+1 -6 t )ds. 
Jo 

Then J2t^=ilt^t and Y^i^t converge a.s. 
Proof. 

(Ci) Since X is compact, the condition (B2) implies that there exists a 
constant c\ S R + such that 

Pt+i - t \\ = || 7m tf(0 t ,x m ) + 7t 1 + 1 >(xt+i)|| < ci7 t+ i[V(x t+ i) + A]. 

The condition (Ai) yields 74+1/7* = 0(1) and (74+1 - j t \ = 0(7t7*+i) 
for t — > 00. Consequently, there exists a constant C2 E M + such that 

7t+i < c 27t 5 1 7m -7*1 < c 27t 2 , t > 0. 
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Let eo = e' = 0, and 

et+i = 7mK#t,x m ) - Pe t u(6 t ,x t )}, 

<4+i =7t+i[ p t+ i^(0*+i> x t+i) ~ p e t "(^,x t+ i)] 

+ {lt+2 -7t+i)Pfl t+1 u(^ + i,x t+ i) +7 t 1 + 1 r ?7(xt + i), 

e* = -7t+i p 0t^(^> x *)- 
It is easy to verify that (29) is satisfied. 
(C2) Since cr(6t) C J^, we have 

J E(u(^,x t+1 )|^)=P et n(0 t ,x t ), 

which concludes (30). The condition (B2) implies that there exist con- 
stants C3, C4, C5, eg, C7, C8 6 M + and r' = min(/3, r) > such that 

||e m || 2 <2 C 37m^ 2 (x t ), 

Ikt+ill < C47t+l^(x t+ i)||%i - + c 5 7 t 2 + i^(x t+ i) + C67ffi"A 

H4 + il| 2 <C87l+i^ 2 (xt + i). 
It follows from the condition (Ai) and the condition sup x y(x) < 00 
that the series J^t^o ll e ml| 2 5 Et^o 1 1 e i 1 1 and EtSo II e " II 2 a11 converge. 
(C3) Let M = sup 0g emax{||/i(0)||, \\Vv (0)||}, and L is the Lipschitz con- 
stant of Vu(-). Since cr{9t} C .Ft, the condition (C2) implies that 
E(V T v(0 t )e t+1 \J r t) = O. In addition, we have 
00 00 
Y,E(\V T v(6 t )e t+1 \f < M 2 Y,E(\h+if) < oo. 
t=o t=o 

It follows from the martingale convergence theorem [Hall and Heyde 
(1980), Theorem 2.15] that both £^ e m and ££0 V T -i;(6> t )e t +i con- 
verge almost surely. Since 

j2\v T v(e t y t+1 \<Mj2\Wt\\, 

t=o t=i 



00 



E7. 2 ii6ii 2 <4Eihii 2 +4EiKii 2 + 8Eii4 / n 2 , 

t=l t=l t=l t=0 

it follows from (C 2 ) that both J2Zo \V T v(9 t )e' t+1 \ and EZilt 
converge. In addition, 

KVill < L\\o t+ i - o t \\ 2 = L\\ Jt+1 h(e t )+ lt+1 ^ t+1 \ 

<2L(Af 2 7 i 2 +1 + 7t 2 +1 ||eml| 2 ), 

\(Vv(e t+1 ) - vv(e t )) T 4 +l \ < L\\e t+1 - ^m^ill, 
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for all t > 0. Consequently, 

oo 

J2\R»\ < 2LM^? + 2^E7 i 2 H6H 2 <oo, 



oo oo 



t=l t=l t=l 



oo oo \ 1/2 

I ^ I o r 2 ;i/f 2 „,2 i o r2 — 2 1 1 1 J|2 | 



J2Kv(9 t+1 )-v(e t )f e» +1 \ < h^M^I + ^^U 

i=0 V t=l i=l 



/ oo \ 1/2 

x^ElK'll 2 ) <oo. 



Since 



E^ = E e * + E4 + <-^ 
t=i t=i t=i 

E = E V T v(d t )e t+1 + e V T ^K +1 



t=o t=o t=o 



t=0 

+ V T v(e n+1 )e'^ 1 -V T v(e )el 
it is obvious that Ylt^i 7t£t an d Rt converge almost surely. 
The proof for Lemma A. 2 is completed. □ 

Based on the above lemmas, Theorem A. 2 can be proved in a similar way 
to Theorem 2.2 of Tadic (1997). Since the manuscript Tadic (1997) is not 
available publicly, we rewrite the proof to make the paper be self-contained. 

Proof of Theorem A. 2. Let M = sup 6 i ee max{||/i(6')||, and 
V £ = {6:v{6) < e}. Applying Taylor's expansion formula [Folland (1990)], 
we have 

v(9 t+1 ) = v(9 t ) + 7 n+ it)(0 t+ i) + Rt +1 , t > 0, 

which implies that 

t t t 

E H+iv{0i) = v(0 t+ i) - v(6 ) - E Ri+i > ~ 2M ~ E ^+i- 

i=0 i=0 i=0 

Since Y^l=oRi+i converges (owing to Lemma A. 2), J2i=oli+i^{^i) a l so con " 
verges. Furthermore, 

t~i t-i 
v(0 t ) = v(6 ) + E li+iv(di) + E ^+i' 1 ^ °> 

i=0 i=0 
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{v(9t)}t>o also converges. On the other hand, the conditions (Ai) and (A2) 
imply lim ^^ d(9t,C) = 0. Otherwise, there exists e > and no such that 
d(0 t ,£) > e, t > n ; as E^iTt = 00 and p = sup{i)(9) :9 G V e c } < 0, it is 
obtained that J2t^n lt+iv(0 t ) <pE^i7m = -00. 

Suppose that lim t -+oo d(9 t , C) > 0. Then, there exists e > such that 
h^ t ^ 00 d(9 t ,£) > 2s. Let t = inf{t > 0:d(9 t ,£) > 2e}, while = inf{t > 
t fc : d(0 t , £) < e} and t k+1 = M{t > t' k : d(9 u C) > 2e}, k > 0. Obviously, t k < 
tk> < tfc+i, k > 0, and 

d(9 tk ,£)>2e, d(e t , k ,£)<£, d(0 t ,C)>e, t k < t < t' k , k > 0. 
Let g = sup{u(6») : 9 G V e c }. Then 

fc=0 A;=0 i=i fc t=0 

t' -1 t' — 1 

Therefore, EfcioEjlt fe Ji+i < 00, and consequently, lim fc ^oo Y,i=t k 7i+i = °- 
Since Et^i7t£t converges (owing to Lemma A. 2), we have 



e< \\9 t > k -9 tk \\ <M J2 7i+1 + 



X 7i+i&+i 
i=t k 



as A; — > 00. This contradicts our assumption e > 0. Hence, lim^oo d(9t, C) > 
does not hold. Therefore, \\m. t ^ oa d{9 t ,C) = almost surely. □ 

A.4. Proof of Theorem A.l. Let e Xt = (e^, . . . , ej m ). Since the kernel 
used in (7) has a bounded support, pu — eu/n can then be re-expressed as 

. _ _ SaSffiU ^(A*/(™fr))(<W« - eg/j 

where k$ = [ Cv ^ lt \, and [2] denotes the maximum integer less than z. By 
noting that —1 < ^ — < +1, we have \pu — < 2&o. This is true even 

when k = 0. Thus, there exists a bounded function —2Cm/k < i]*(e Xt ) < 
2Cm/A such that 

(32) pu - e u /n = h t r]*(e Xt ). 

Since ht is chosen in (9) as a power function of 7$, the SSAMC algorithm 
falls into the class of stochastic approximation MCMC algorithms described 
in Section A. 2 by letting r](x t ) = (??i(e Xt ), . . . ,r/^(e Xt )), and its convergence 
can be proved by verifying that it satisfies the conditions (Ai) to (A3): 

( Ai ) It is obvious that this condition is satisfied by the sequence as specified 
in (6). 
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(A2) Let x t+ i = (sCi + i, ■ • . ,aJt"i), which can be regarded as a sample pro- 
duced by a Markov chain on the product space X K = X x • • • x X with 
the kernel 

Pg t (x, y ) = Pg t (*("> , j, (D ) P„ t (y « , y ( 2 ) ) • • • P 0< (y^ 1 ) , y W ) , 

where Pg t y) denotes the one-step MH kernel. To simplify notations, 
in the following we will drop the subscript t, denoting xj by x and 
Ot = {Oti, ■ • •j^tm) by 6 = ... , m ). 

Roberts and Tweedie (1996, Theorem 2.2) showed that if the target dis- 
tribution is bounded away from and 00 on every compact set of its support 
X, then the MH chain with a proposal distribution satisfying the local posi- 
tive condition is irreducible and aperiodic, and every nonempty compact set 
is small. It follows from this result that Pg(x,y) is irreducible and aperiodic, 
and thus Pg(x,y) = Pg(x,y) is also irreducible and aperiodic. 

Since X is compact, Roberts and Tweedie's result implies that X is a 
small set and the minorization condition holds on X for the kernel Pg(x,y); 
that is, there exist an integer a constant 5 and a probability measure z/(-) 
such that 

P l e \x,A)>8v'(A) \fx G X, \/A G Bx- 

It then follows from Rosenthal (1995, Lemma 7) that 

P&(x,A)>&/(A) Vxer, VAgB^, 

by setting I = min{n:n x k > V, n = 1, 2, 3, . . .} and defining the measure 
v{-) as follows: Marginally on the first coordinate, v{-) agrees with f'(-); 
conditionally on the first coordinate, u{-) is defined by 

(33) u{x^ 2 \ . . . , X W\xW) = W(x {2 \. . . ,xW|xW), 

where W(a^ . . . , arM^ 1 )) is the conditional distribution of the Markov 
chain samples generated by the kernel Pg. Conditional on x^ \ the samples 
generated independent of all previous samples x t _i, . . . ,xi. 
Hence, W(x^ 2 \ . . . , x^\x^) exists. This verifies condition (24) by setting 
C = X K . Thus, for any 9 G O the following conditions hold: 

PeF Q (x)<Ar(x) + 6/(xGC) VxG**, 

(34) 

PeV a {x) <?y Q (x) VxG^ K , 

by choosing V(x) = 1, 0<A< 1,6 = 1 — A, ? > 1, and a > 2. These conclude 
that (A2)(i) is satisfied. 

Let (# ; x) be the ith component of the vector (0, x) = (e x / k — tt). By 

construction, |ii"W(0,x)| = |ex /« — vrj| < 1 for all x G ^ K and i = 1, . . . ,m. 
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Therefore, there exists a constant c\ = \fm such that for any 9 G and all 

(35) \\H{e,x)\\<c x . 

Also, H(9,x) does not depend on 9 for a given sample x. Hence, H(9,x) — 
H(9',x) = for all (6, 6') G x 0, and the following condition holds for the 
SSAMC algorithm: 

(36) \\H(0,x) - H(9',x)\\ < ci||0 - 9% 

for all (9, 9') G x 0. Equations (35) and (36) imply that (A2)(ii) is satisfied 
by choosing (3 = 1 and V(x) = 1. 

Let so(x,y) = q(x,y)mm{l,rg(x,y)}, where rg(x,y) = ^Mtt^I - Thus, 



we have 



www ' 



ds e (x,y) 



d9i 



\-q(x,y)I(r e (x,y) < 1) 
x /(J(s) =i or J(y)=i)I(J(x) ± J(y))r e (x,y)\ 

< q(x,y), 

where J(-) is the indicator function, and J{x) denotes the index of the sub- 
region to which x belongs. The mean- value theorem implies that there exists 
a constant C2 such that 

(37) \se(x,y) - s e >{x,y)\ < q(x,y)c 2 \\9 - 9'\\, 
which implies that 

(38) sup / \s e (x,y) - sg>(x,y)\ dy < c 2 \\9 - 9'\\. 

x Jx 

Since the MH kernel can be expressed in the form 



Pg(x, dy) = s g (x, dy) + I(x G dy) 

for any measurable set A C X we have 
\P e (x,A)-P e >(x,A)\ 



1 



x 



sq(x, z) dz 



)(x,y) - s e >(x,y) 



dy 



(39) 



I(x G A) [ 
Jx 



dz 



9'(x,z) - sg(x,z) 

< J \sg(x,y) - S0>(x,y)\ dy + I(x G A) 

<2 / \s e (x,y) - s 0/ (x,y)\dy 
Jx 

<2c 2 \\9-9'\\. 



x 
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Since Pg(x, A) can be expressed in the following form: 
P<?(x,A) 

P e (x( K \yW)P f) (y( 1 \y®)---P () (y( K - 1 \yW)dyM---dyW, 



lAi JA K 

(39) implies that there exists a constant C3 such that 
|P fl (x,A)-P fl ,(x,A)| 



Ax 



< 



Ai JX JX 



[P e (xW,yW) 

x^(!/ (1) 1 !/ (2) )'-^(!/ M 1 !/ W ) 
-P e ,(x W ,2/ (1) )^(y (1) ,y (2) )--- 

xP e ,(y( K - 1 ),yW)]dy( 1 )---d 2 /W 

|P e (xW,y«)-P^(xW,y( 1 ))| 

xnd/ 11 ',^)-^^" 1 ',!/'^!/ 11 ''-^ 

p e ,(x (K) , y (1) )|P,( y (1) ,2/ (2) )-^(y (1) ,y (2) )l 

x P e (yW , ) • • • Peiy^, y W) dy« • • • dj/M 



+ 
+ 



A" J A" J A 



P 9 /(xW )!/ «)...p y (y ( ' s - 2) ,y ( ' t - 1) ) 

x |P 9 (y^,i/W) - P 6 > (y M , I dy^ • • • dyW 



<c 3 \\e-e'\\, 

which implies that (27) is satisfied. 
For any function g £ Cy, 

\\P e g-Pe'g\\v-- 



(P fl (x,dy)-P fl ,(x,dy))g(y) 

/ (P fl (x,dy)-P fl /(x,dy))g(y) 

ja^ 

+ / (P e (x,dy)-P e /(x,dy)) 5 (y) 
Jx k 



< \\g\\ v {\P e (x,XZ) ~ P*'(x,^)| + |P«(x,^) - Py(x,^)|} 

< 4c 2 ||s||y|0 - 0'| [following from (39)] 
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where X* = {y : y G X R , P e (x, dy ) - P g > (x, y ) > 0} and XI = X K \ X£ . There- 
fore, condition (A2) (iii) is satisfied by choosing V(x) = 1 and (3 = 1. 

(A3) Since the invariant distribution of the kernel Pq(x, •) is fe(x), we have 
for any fixed 9, 

by (0 / \ - h^)^/^ _ 

(40) , 

where 5j = Jg. VK^) dx/e 9i and 5 1 = SfcLi Thus, we have 

h(6)= f ff(0,x)/(dx) 
Jx 

-^--TTl,..., — - 7T„ 

It follows from (40) that /i(#) is a continuous function of #. Let t>(#) = 
^Z)™=i( ; 5 L — ^fc) 2 - As shown below, v(9) has continuous partial derivatives 
of the first order. 

Solving the system of equations formed by (40), we have 

£ = j(#l> • • • ; &m) '■ 

0i = Const + log ^ ip(x) dx^J — log(7Tj),i = 1, . . . ,m; 9 € g|, 

where Const = log(S') can be determined by imposing a constraint on S. For 
example, setting 5 = 1 leads to that c = 0. It is obvious that C is nonempty 
and v(9) = for every 9 £ jC. 

To verify the conditions related to v(9), we have the following calculations: 

di = ds 1 

d9i 89, 

dSt _ dSj _ 
d(h~~d9~~ ' 

(41) 

d (St S) # f s> 



d9i S\S 

d(Sj/S) _ d(Sj/S) _ SjSj 
39 j 89i S 2 ' 
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for i,j = 1, . . . ,m and i / j, 

dv{6) _ 1 " d(S k /S - 7T k ) 2 

89i 2 ^ dBi 



E( Sj \ SiSj f Si \ Si ( s. 



III 



Si \ SiSj ( Si \ Si 



S 3 S 2 \S 1 S 



TT, 



Si f Si \ s. 



(42) 



^ s \s S ' 

for i = 1, . . . , m, where n^* = Y^=i(~^ ~ ^i)"^"- Thus, we have 

(| -*<)§- £(§-"<) I 

(43) -{Kf-.) 2 !-,?. 1 



= -4 < o, 

where u^* denotes the variance of the discrete distribution defined in the 
following table: 



State (77*) 
Prob. 



— 7Tl 

s 



If 6* E £, = 0; otherwise, < 0. Therefore, sup 0g Q v{9) < for any 
compact set Q C C c . 
The proof is completed. 
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